Skip to content

Conversation

@jeynmann
Copy link
Contributor

@jeynmann jeynmann commented Nov 6, 2025

What?

Ring doorbell in atomic way.
No need cas wait on db lock.
Based on PR10959

case_name msg_size (KB) Version Latency (us) BW (MB/s) Ops (msg/s) Speedup
ucp_put_single_bw 4 pr 30.219 4136.46 1.05893e+06 1.05x ⬆️
ucp_put_single_bw 4 master 31.728 3939.7 1.00856e+06 -
ucp_put_single_lat 4 pr 80.227 1558.07 398866 0.99x
ucp_put_single_lat 4 master 79.722 1567.94 401394 -
ucp_put_multi_bw 4 pr 142.102 879.65 225191 1.01x
ucp_put_multi_bw 4 master 142.997 874.14 223780 -
ucp_put_multi_lat 4 pr 79.876 1564.92 400620 0.99x
ucp_put_multi_lat 4 master 79.151 1579.25 404289 -
ucp_put_partial_bw 4 pr 142.047 879.99 225278 1.00x
ucp_put_partial_bw 4 master 142.199 879.05 225036 -
ucp_put_partial_lat 4 pr 79.968 1563.12 400159 0.99x
ucp_put_partial_lat 4 master 79.3 1576.28 403529 -
ucp_put_single_bw 8 pr 30.99 8067.14 1.03259e+06 1.05x ⬆️
ucp_put_single_bw 8 master 32.441 7706.19 986393 -
ucp_put_single_lat 8 pr 81.549 3065.62 392400 1.00x
ucp_put_single_lat 8 master 81.646 3061.99 391935 -
ucp_put_multi_bw 8 pr 148.761 1680.55 215111 1.00x
ucp_put_multi_bw 8 master 148.697 1681.27 215203 -
ucp_put_multi_lat 8 pr 83.206 3004.6 384589 1.01x
ucp_put_multi_lat 8 master 84.064 2973.94 380664 -
ucp_put_partial_bw 8 pr 148.644 1681.87 215279 1.00x
ucp_put_partial_bw 8 master 148.697 1681.27 215203 -
ucp_put_partial_lat 8 pr 83.283 3001.8 384231 1.00x
ucp_put_partial_lat 8 master 83.58 2991.13 382865 -

Why?

Optimize performance for doorbell.

How?

Merge 2 stage db into 1.

Summary by CodeRabbit

  • Refactor
    • Restructured internal database operation handling and synchronization logic to optimize performance characteristics.

@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

Walkthrough

The function uct_rc_mlx5_gda_db has been reworked to replace a locked/inline doorbell update mechanism with a guarded-reservation approach. New logic computes a skip_db flag and routes execution through either an atomic reserve path or a wait-and-ring-db path with explicit DBR ring operations.

Changes

Cohort / File(s) Summary
GDA Doorbell Mechanism Refactor
src/uct/ib/mlx5/gdaki/gdaki.cuh
Reworked uct_rc_mlx5_gda_db to introduce conditional skip-path logic. Added wqe_next computation and skip_db flag determination. If skip_db is true, perform atomic compare_exchange reserve; otherwise, wait for sq_ready_index alignment and execute ring DB operations with DBR updates. Removed previous coarse-grained locking and inline DB update sequence.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas requiring extra attention:

  • Atomic compare_exchange logic and memory ordering semantics for correctness
  • Thread-safety of the skip_db flag computation and conditional paths
  • Doorbell ring operation sequence and DBR update ordering
  • Interaction between wqe_base, wqe_next, and sq_ready_index state management

Poem

🐰 A guard at the gate, a path skipped with care,
Compare-and-exchange floats through the air,
Ring-ring-ring bells in a critical dance,
Doorbells optimized—the code took its chance!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'UCT/DEVICE: ring db in atomic way' clearly describes the main change—implementing atomic doorbell ring operations, which aligns with the PR's core objective to eliminate CAS waits on db locks.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 25768e8 and 86f1972.

📒 Files selected for processing (1)
  • src/uct/ib/mlx5/gdaki/gdaki.cuh (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: UCX PR (Static_check Static checks)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87e88f7 and 25768e8.

📒 Files selected for processing (6)
  • src/uct/ib/mlx5/gdaki/gdaki.c (2 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki.cuh (3 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki_dev.h (0 hunks)
  • test/gtest/ucp/cuda/test_kernels.cu (0 hunks)
  • test/gtest/ucp/cuda/test_kernels.h (0 hunks)
  • test/gtest/ucp/test_ucp_device.cc (0 hunks)
💤 Files with no reviewable changes (4)
  • test/gtest/ucp/cuda/test_kernels.cu
  • src/uct/ib/mlx5/gdaki/gdaki_dev.h
  • test/gtest/ucp/cuda/test_kernels.h
  • test/gtest/ucp/test_ucp_device.cc

@jeynmann jeynmann changed the title [DNM][WIP] UCT/DEVICE: ring db in atomic way UCT/DEVICE: ring db in atomic way Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants